Presentable Document Format: Improved On-demand PDF to HTML Conversion

نویسنده

  • Ralph Sommerer
چکیده

Search engines such as Google and MSN Search crawl and index files in Adobe’s Portable Document Format (PDF) alongside material in HTML. Google furthermore offers a View as HTML option for PDF that includes query term highlighting. The visual appearance of these HTML files converted from PDF is very poor. In this paper we claim that significant improvements to the quality of on-demand PDF to HTML conversion can be achieved at insignificant cost in terms of increased file size and processing time. We can show in particular, that a slightly more sophisticated HTML coding can easily compensate for the increase in file size when including line graphics and images.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PDF2XML: Converting PDF to XML

XML is a markup language for documents containing structured information. It is designed to make it easy to interchange structured documents over the Internet and further integrate them with management database system. PDF is a document format intended to electronically reproduce the look of a page. There is a huge demand of converting existing PDF documents into XML documents, so that they wil...

متن کامل

From Legacy Documents to XML: A Conversion Framework

We present an integrated framework for the document conversion from legacy formats to XML format. We describe the LegDoC project, aimed at automating the conversion of layout annotations layout-oriented formats like PDF, PS and HTML to semantic-oriented annotations. A toolkit of different components covers complementary techniques the logical document analysis and semantic annotations with the ...

متن کامل

Tagged mathematics in PDFs for accessibility and other purposes

PDF has been the preferred format for publishing mathematics for many years now. With changes to methods of delivery (i.e., electronic rather than predominantly paper) there need to be corresponding enhancements in the document format. Not least among these can be implicit legal obligations to satisfy Accessibility criteria. The answer developed for PDF is tagging of document structure and cont...

متن کامل

Research and Realization about Conversion Algorithm of PDF Format into PS Format

This paper firstly introduces the characteristics of PostScript document and PDF document as the basis, and proposes the necessity and the feasibility of the conversion from the PDF document format to the PostScript language program. Secondly, it studies the main algorithm and technology of the conversion process and realizes the information extraction for PDF document lastly, with achieving th...

متن کامل

Oncogene pdf

Oncogene pdf Advances in science have improved our knowledge of the inner workings of cells, the basic building blocks of the body.viral oncogenes. proto oncogene pdf The latter were previously characterized as the specific genetic elements capable of conferring the tumorigenic properties to the ribonucleic.Describe how the HER2neu oncogene is activated in breast cancer. oncogene addiction pdf ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004